docs(readme): fix typo in install curl snippet
Casebook / Side-by-Side Outputs / Not Hidden In My Head
逐题逐输出对比, 让测评不再只是一句“我觉得”。
这个页面把每个 brief 的每道下游任务都摊开:题目是什么、为什么要测、期望是什么、四种答案分别长什么样、每个版本平均拿了多少分。
如果你想检验这次排序是不是“有理有据”,这里就是最关键的证据页。
Method Reminder
为什么这里的对比是能成立的
因为每个版本拿到的是同一个 brief、同一个 task、同一个代表性 query、同一个评分逻辑。这里不是比较文采,而是比较在同样任务压力下,哪个 creator 生成的 skill 更能把模型锁进正确轨道。
- 同一 brief JSON:任务描述、artifact hint、downstream tasks、trigger query 全部相同。
- 同一 Codex / 同一模型 / 同一机器 / 同一运行时。
- 每个 task 看 3 次重复平均,不把一次偶然波动当结论。
- 页面里展示的是 repeat 1 输出,但分数显示的是 repeat 平均值。
Benchmark Brief
Conventional Commit
测什么:测极简高压缩输出:creator 能不能把 skill 收束到“一行 commit”这种极易跑偏的目标。
为什么这题有比较意义:这是最适合测“不要啰嗦、不要乱扩写、要把边界写死”的题型。skill creator 一旦结构松,立刻会把它写胖。
这题怎么打分:格式分看单行、无代码块、长度、commit 正则;语义分看 type、breaking、关键名词是否保留。
平均语义准确率:100
平均创建耗时:78.63s
平均创建 token:107,390
平均语义准确率:100
平均创建耗时:74.23s
平均创建 token:103,860
平均语义准确率:100
平均创建耗时:70.27s
平均创建 token:91,593
Task cc_docs
Write the commit message for this change: fix a typo in the README install curl snippet.docs(readme): fix install curl snippet typo
docs(readme): fix install curl snippet typo
docs(readme): fix typo in install curl snippet
Task cc_fix
Compress this into one Conventional Commit: avoid a startup crash when WEBHOOK_SECRET is missing.fix(startup): avoid crash when WEBHOOK_SECRET is missing
fix: avoid startup crash when webhook secret is missing
fix: avoid startup crash when webhook secret is missing
fix: avoid startup crash when WEBHOOK_SECRET is missing
Task cc_perf
One-line commit for my diff: cache parsed config during boot and cut cold start time by 35%.perf(config): cache parsed config during boot to reduce cold start time by 35%
perf: cache parsed config during boot
perf: cache parsed config during boot
perf: cache parsed config during boot
Task cc_breaking
Should this be feat or fix commit? Replace the public sync flag --team with --workspace; the old flag now errors with a migration hint.feat!: replace public sync flag --team with --workspace
feat(sync)!: replace --team with --workspace
feat(sync)!: replace --team with --workspace
feat(sync)!: replace --team with --workspace
Task cc_feat
Write the commit message: add a --dry-run option to the sync command.feat(sync): add --dry-run option
feat(sync): add --dry-run option
feat(sync): add --dry-run option
feat(sync): add --dry-run option
Benchmark Brief
Support Ticket Triager
测什么:测严格 JSON 分类、标签体系表达、可配置结构,以及 creator 对 schema / config / interface metadata 的判断能力。
为什么这题有比较意义:这类题很接近真实工作:既要准分类,又要稳定 obey 输出结构,还要考虑以后会不会改标签。
这题怎么打分:格式分看能否解析 JSON 与 key 是否齐全;语义分按 severity、product_area、issue_type、customer_tier、needs_human_reply 五个字段逐项命中。
平均语义准确率:100
平均创建耗时:120.85s
平均创建 token:174,391
平均语义准确率:94.67
平均创建耗时:98.45s
平均创建 token:113,622
平均语义准确率:100
平均创建耗时:139.72s
平均创建 token:264,397
Task tt_sso
Please triage this support ticket: Tier: enterprise Ticket: Since yesterday's SSO cutover, nobody at Acme can log in. About 140 seats are blocked and payroll closes in 3 hours.Priority: Critical
Category: Authentication / SSO outage
Impact: High — org-wide login failure affecting about 140 users, time-sensitive payroll risk in 3 hours
Route to: Enterprise support + Identity/SSO engineering immediately
Severity: Sev-1
Suggested handling: Treat as a production outage, open an incident now, and contact the customer on the urgent channel while investigating the SSO cutover.
{"severity":"sev1","product_area":"auth","issue_type":"outage","customer_tier":"enterprise","needs_human_reply":true}
{"severity":"sev1","product_area":"auth","issue_type":"outage","customer_tier":"enterprise","needs_human_reply":true}
{"severity":"sev1","product_area":"auth","issue_type":"outage","customer_tier":"enterprise","needs_human_reply":true}
Task tt_dark_mode
Triage this customer ticket: Tier: free Ticket: Love the product. Can you add dark mode to the dashboard? The white UI is harsh at night.Priority: Low
Category: Feature request / UI theming
Impact: Low — no breakage, usability preference
Route to: Product feedback backlog
Severity: Sev-4
Suggested handling: Acknowledge the request, note dashboard dark mode demand, and add it to the feature-request queue for prioritization.
{"severity":"sev3","product_area":"ui","issue_type":"feature_request","customer_tier":"free","needs_human_reply":true}
{"severity":"sev3","product_area":"ui","issue_type":"feature_request","customer_tier":"free","needs_human_reply":true}
{"severity":"sev3","product_area":"ui","issue_type":"feature_request","customer_tier":"free","needs_human_reply":true}
Task tt_export_blank
Classify this ticket for the queue: Tier: pro Ticket: CSV export downloads a blank file for March invoices. I tried Chrome and Safari and both are blank.Priority: High
Category: Billing / Export bug
Impact: Medium-High — invoice CSV export for March is unusable across multiple browsers, likely not client-specific
Route to: Pro support + Billing/Reporting engineering
Severity: Sev-2
Suggested handling: Triage as a reproducible product defect affecting financial exports and investigate March invoice export generation.
{"severity":"sev2","product_area":"export","issue_type":"bug","customer_tier":"pro","needs_human_reply":true}
{"severity":"sev2","product_area":"export","issue_type":"data_issue","customer_tier":"pro","needs_human_reply":true}
{"severity":"sev2","product_area":"export","issue_type":"bug","customer_tier":"pro","needs_human_reply":true}
Task tt_invoice_question
Please triage: Tier: business Ticket: Where do I change the email address that receives our invoices?Priority: Normal
Category: Billing / Account configuration question
Impact: Low — how-to request, no outage
Route to: Billing support
Severity: Sev-4
Suggested handling: Respond with instructions for updating the billing contact email in account billing settings, and verify required admin permissions if needed.
{"severity":"sev3","product_area":"billing","issue_type":"question","customer_tier":"business","needs_human_reply":true}
{"severity":"sev3","product_area":"billing","issue_type":"question","customer_tier":"business","needs_human_reply":true}
{"severity":"sev3","product_area":"billing","issue_type":"question","customer_tier":"business","needs_human_reply":true}
Task tt_duplicate_events
Queue this support ticket: Tier: enterprise Ticket: Your webhook system delivered duplicate events for about two hours this morning and our downstream system created duplicate shipments.Priority: Critical
Category: Webhooks / Duplicate event delivery
Impact: High — customer reports downstream duplicate shipments caused by duplicate webhook deliveries for about two hours
Route to: Enterprise support + Integrations/Webhooks engineering immediately
Severity: Sev-1
Suggested handling: Open an incident, investigate delivery logs and retry behavior for the affected window, and coordinate customer mitigation guidance for duplicate processing.
{"severity":"sev1","product_area":"api","issue_type":"data_issue","customer_tier":"enterprise","needs_human_reply":true}
{"severity":"sev2","product_area":"api","issue_type":"data_issue","customer_tier":"enterprise","needs_human_reply":true}
{"severity":"sev1","product_area":"api","issue_type":"data_issue","customer_tier":"enterprise","needs_human_reply":true}
Benchmark Brief
Command Risk Checker
测什么:测安全判断类 skill:creator 能否让产物既短、又硬约束、还能稳定区分 safe / confirm / refuse。
为什么这题有比较意义:这是判断型 skill 的典型代表,特别适合测 creator 会不会把规则写散、写软、写到不够可执行。
这题怎么打分:格式分看 JSON 与 key 完整;语义分 80% 给 verdict,20% 给 safer_alternative 是否按预期出现。
平均语义准确率:100
平均创建耗时:101.76s
平均创建 token:127,341
平均语义准确率:100
平均创建耗时:95.68s
平均创建 token:131,832
平均语义准确率:100
平均创建耗时:84.57s
平均创建 token:142,980
Task cr_ls
Risk-check this command before I run it: ls -la /var/log/myappLow risk if you trust the path. `ls -la /var/log/myapp` only lists files and metadata; it does not modify anything. Possible concerns: it may fail with permission errors, and it could reveal sensitive log filenames if others can see your terminal output.
{"verdict":"safe","reason":"Read-only directory listing of a specific path.","safer_alternative":""}
{"verdict":"safe","reason":"This is a read-only directory listing with low risk.","safer_alternative":""}
{"verdict":"safe","reason":"Read-only directory listing of a log path.","safer_alternative":""}
Task cr_reset
Is this command safe? git reset --hard HEAD~3Not safe if you might need those last 3 commits or any uncommitted changes. `git reset --hard HEAD~3` permanently moves your branch back three commits and discards tracked working-tree changes. Safer checks first: `git log --oneline -n 5`; safer alternatives: `git reset --soft HEAD~3` or create a backup branch with `git branch backup-before-reset`.
{"verdict":"confirm","reason":"Irreversibly rewrites git history and discards local changes.","safer_alternative":"git log --oneline HEAD~3..HEAD && git branch backup-before-reset && git reset --hard HEAD~3"}
{"verdict":"confirm","reason":"This rewrites Git history and discards uncommitted changes.","safer_alternative":"git status && git log --oneline -3"}
{"verdict":"confirm","reason":"Rewrites local git history and discards working tree changes.","safer_alternative":"git status && git log --oneline -n 5"}
Task cr_rm_root
Quick safety check: rm -rf /No. `rm -rf /` is extremely dangerous because it attempts to recursively delete the entire root filesystem. On many systems it is blocked unless forced with extra flags, but you should treat it as catastrophic and never run it.
{"verdict":"refuse","reason":"Irreversible root-level filesystem wipe.","safer_alternative":"ls /"}
{"verdict":"refuse","reason":"This attempts broad filesystem deletion at the root and is catastrophically destructive.","safer_alternative":"ls /"}
{"verdict":"refuse","reason":"Recursively wipes the root filesystem.","safer_alternative":"Inspect the specific path first and delete only that target."}
Task cr_k8s_prod
Risk-check this: kubectl delete namespace prodVery high risk. `kubectl delete namespace prod` deletes the entire `prod` namespace and everything in it, often including running workloads, services, config, and secrets. Only run it if you explicitly intend to destroy that environment. Safer checks: `kubectl config current-context`, `kubectl get ns`, and consider a narrower delete target instead.
{"verdict":"refuse","reason":"Deletes a clearly production Kubernetes namespace.","safer_alternative":"kubectl get namespace prod"}
{"verdict":"refuse","reason":"This deletes a production-like namespace and can remove live resources.","safer_alternative":"kubectl get namespace prod"}
{"verdict":"refuse","reason":"Deletes a production namespace and all contained resources.","safer_alternative":"kubectl get namespace prod && kubectl delete namespace <non-prod> --dry-run=server"}
Task cr_mkdir_copy
Is this command safe to run? mkdir -p tmp/build-cache && cp artifact.tgz tmp/build-cache/Generally safe. `mkdir -p tmp/build-cache && cp artifact.tgz tmp/build-cache/` creates a local directory if needed and copies `artifact.tgz` into it. Main risks: `artifact.tgz` may not exist, the copy may overwrite an existing file of the same name, and disk usage will increase.
{"verdict":"safe","reason":"Creates a local scratch folder and copies into it.","safer_alternative":""}
{"verdict":"safe","reason":"This only creates a local directory and copies a file into it.","safer_alternative":""}
{"verdict":"safe","reason":"Creates a local cache directory and copies one file into it.","safer_alternative":""}
Task cr_delete_pyc
Safety-check this command: find . -name "*.pyc" -deleteUsually safe, but use care. `find . -name "*.pyc" -delete` removes all Python bytecode files under the current directory tree. It is often fine for cleanup because `.pyc` files are regenerated, but the risk is scope: if you run it from the wrong directory, it will delete more widely than intended. Safer preview: `find . -name "*.pyc"`.
{"verdict":"confirm","reason":"Broad delete pattern removes many files irreversibly.","safer_alternative":"find . -name \"*.pyc\" -print"}
{"verdict":"confirm","reason":"This deletes matched files under the current directory tree.","safer_alternative":"find . -name \"*.pyc\" -print"}
{"verdict":"confirm","reason":"Broad file deletion depends on the current directory scope.","safer_alternative":"find . -name \"*.pyc\" -print"}
Benchmark Brief
Weekly Release Notes
测什么:测模板化 markdown 归纳、资产/模板使用意识,以及 creator 是否会把固定骨架下沉到更合适的位置。
为什么这题有比较意义:这类题不是纯分类,也不是纯自由写作,而是“固定骨架 + 归类整理”的真实协作型任务。
这题怎么打分:格式分看 heading 顺序、section 完整性、bullet 纪律;语义分看关键 token 是否落到正确 section,空 section 是否写 `- None`。
平均语义准确率:100
平均创建耗时:98.61s
平均创建 token:132,387
平均语义准确率:100
平均创建耗时:72.89s
平均创建 token:95,011
平均语义准确率:100
平均创建耗时:77.82s
平均创建 token:129,896
Task rn_mix_a
Turn these merged PR notes into the weekly release note: - Added CSV export for payouts - Fixed OAuth token refresh loop - Docs: new self-hosting guideThis week’s release includes CSV export for payouts, a fix for an OAuth token refresh loop, and a new self-hosting guide in the docs.
## Added
- Added CSV export for payouts.
## Fixed
- Fixed the OAuth token refresh loop.
## Changed
- None
## Docs
- Added a new self-hosting guide.
## Added
- Added CSV export for payouts.
## Fixed
- Fixed an OAuth token refresh loop.
## Changed
- None
## Docs
- Added a new self-hosting guide.
## Added
- Added CSV export for payouts
## Fixed
- Fixed the OAuth token refresh loop
## Changed
- None
## Docs
- Added a new self-hosting guide
Task rn_mix_b
Please normalize these release notes: - Changed the billing page layout to group invoices by month - Fixed duplicate webhook deliveries - Added dry-run mode to sync commandThis release updates the billing page to group invoices by month, fixes duplicate webhook deliveries, and adds a dry-run mode to the sync command.
## Added
- Added a dry-run mode to the sync command.
## Fixed
- Fixed duplicate webhook deliveries.
## Changed
- Updated the billing page layout to group invoices by month.
## Docs
- None
## Added
- Added dry-run mode to the sync command.
## Fixed
- Fixed duplicate webhook deliveries.
## Changed
- Changed the billing page layout to group invoices by month.
## Docs
- None
## Added
- Added dry-run mode to the sync command
## Fixed
- Fixed duplicate webhook deliveries
## Changed
- Changed the billing page layout to group invoices by month
## Docs
- None
Task rn_docs_only
Make the weekly release note from these merged items: - Docs: clarify SSO setup steps - Docs: add API retry examplesThis week’s documentation updates clarify the SSO setup steps and add API retry examples.
## Added
- None
## Fixed
- None
## Changed
- None
## Docs
- Clarified the SSO setup steps.
- Added API retry examples.
## Added
- None
## Fixed
- None
## Changed
- None
## Docs
- Clarified SSO setup steps.
- Added API retry examples.
## Added
- None
## Fixed
- None
## Changed
- None
## Docs
- Clarified SSO setup steps
- Added API retry examples
Benchmark Brief
Stakeholder Brief
测什么:测“脏输入 -> 固定摘要结构”的抽取能力,尤其是 creator 如何处理 heuristics、模板与 body 的分工。
为什么这题有比较意义:这是最能拉开 creator 水平的题:输入脏、边界多、容易凭感觉总结,特别考验 instruction 架构。
这题怎么打分:格式分看固定 heading 和 bullet 结构;语义分看 Goal / Decisions / Open Questions / Next Steps 四类信息是否被准确抽出。
平均语义准确率:86.11
平均创建耗时:103.92s
平均创建 token:125,870
平均语义准确率:80.55
平均创建耗时:95.45s
平均创建 token:132,610
平均语义准确率:100
平均创建耗时:68.43s
平均创建 token:89,581
Task ib_launch
Normalize these call notes into the standard brief: - Goal is to launch self-serve SSO in May - Decided to keep SCIM out of v1 - Open question: should audit logs be enterprise-only? - Next step: Sara drafts rollout plan by TuesdayGoal: Launch self-serve SSO in May
Decisions:
- Keep SCIM out of v1
Open Questions:
- Should audit logs be enterprise-only?
Next Steps:
- Sara to draft the rollout plan by Tuesday
## Goal
- Launch self-serve SSO in May.
## Decisions
- Keep SCIM out of v1.
## Open Questions
- Decide whether audit logs should be enterprise-only.
## Next Steps
- Sara drafts the rollout plan by Tuesday.
## Goal
- Launch self-serve SSO in May.
## Decisions
- SCIM is out of scope for v1.
## Open Questions
- Should audit logs be limited to enterprise plans?
## Next Steps
- Sara drafts the rollout plan by Tuesday.
## Goal
- Launch self-serve SSO in May
## Decisions
- Keep SCIM out of v1
## Open Questions
- Should audit logs be enterprise-only?
## Next Steps
- Sara drafts the rollout plan by Tuesday
Task ib_pricing
Turn this stakeholder call dump into the standard brief: - We are trying to reduce enterprise sales friction - Decision: show annual pricing on the website - Decision: legal review not needed for the copy tweak - Next: Max updates the pricing page tomorrowGoal: Reduce enterprise sales friction
Decisions:
- Show annual pricing on the website
- Legal review is not needed for the copy tweak
Next Steps:
- Max to update the pricing page tomorrow
## Goal
- Reduce enterprise sales friction.
## Decisions
- Show annual pricing on the website.
- Legal review is not needed for the copy tweak.
## Open Questions
- None
## Next Steps
- Max updates the pricing page tomorrow.
## Goal
- Reduce friction in the enterprise sales process.
## Decisions
- Show annual pricing on the website.
- No legal review is needed for the pricing copy update.
## Open Questions
- None
## Next Steps
- Max updates the pricing page tomorrow.
## Goal
- Reduce enterprise sales friction
## Decisions
- Show annual pricing on the website
- Legal review is not needed for the copy tweak
## Open Questions
- None
## Next Steps
- Max updates the pricing page tomorrow
Task ib_research
Please standardize these notes: - Want clearer export limits for pro users - Open question: should we hard-cap rows or queue large exports? - Next step: Priya interviews 5 pro customers next weekGoal: Clarify export limits for pro users
Open Questions:
- Should exports have a hard row cap or queue large exports?
Next Steps:
- Priya to interview 5 pro customers next week
## Goal
- Clarify export limits for pro users.
## Decisions
- None
## Open Questions
- Decide whether to hard-cap rows or queue large exports.
## Next Steps
- Priya interviews five pro customers next week.
## Goal
- Clarify export limits for pro users.
## Decisions
- None
## Open Questions
- Should large exports have a hard row cap or be handled through a queue?
## Next Steps
- Priya interviews five pro customers next week.
## Goal
- Clearer export limits for pro users
## Decisions
- None
## Open Questions
- Should exports hard-cap rows or queue large exports?
## Next Steps
- Priya interviews 5 pro customers next week
File Paths
相关原始文件
如果你想继续深挖,可以直接看这些源数据。